next up previous contents
Next: Prospects for automatic lexicon Up: Acquisition of the Lexicon Previous: Machine Readable Dictionaries

Corpus-based Acquisition

In contrast to dictionaries, text corpora are primary sources of information about language use. They can be analyzed using statistical techniques, to derive information about word frequency, co-occurrence, etc. They support detailed studies of how particular words are used by providing extensive examples of natural language sentences in context (Atkins 1991).

However, corpora suffer from the fact that they consist only of surface words -- in many corpora not even syntactic annotations are provided, much less clues about the meaning conveyed by the text. Linguistic pre-processing is necessary to convert corpora into a form useful for lexical acquisition, by identifying parts of speech, syntactic structures, morphologically related words, etc.

Some statistical work has treated MRDs themselves as corpora on which analysis is performed. For example, Tony Plate describes a technique in Wilks 1989 for acquiring data on the co-occurrence of pairs of words in LDOCE. The frequency of co-occurrence of two words (or senses as identified in the dictionary) is argued to be a measure of the strength of the semantic relationship between them. Such data could be useful in solving various NLP problems, such as identification of the topic of a discourse or lexical disambiguation (e.g. Guthrie 1991, Wilks and Stevenson wilks_stevenson:97).gif

Various statistical tools for gathering information from corpora are described in Church (1991) and Church (1994). These are the mutual information test, which measures the similarity of two words, the t-test, which measures the difference between two words, and the substitutability test which identifies sets of words with similar distributions. The last test might pick out words which are synonyms, antonymys and co-hyponyms. Church argue that these tests can provide different insights into relations between words, but that ``great care and skill will be needed in interpreting the salient features of the sets that are identified'' (Church 1994:174). They view the role of these tools as aiding the lexicographer in his work (and thereby hopefully improving the quality and accuracy of the information in dictionary definitions), rather than as a basis for lexicon acquisition. However, they can also be seen as tools to be used in semi-automatic lexical acquisition, establishing relationships which can be evaluated by a human working with the system to develop a lexicon which reflects the idiosyncrasies of language use.

Other statistical work (e.g. Bruce and Wiebe 1995), however, views the lexicon as a probabilistic model characterising a rich set of relationships between a large number of variables. These models are developed on the basis of training data, including some data tagged for word senses and syntactic structure, and assume a predefined and finite set of sense distinctions for each word. Under the Bruce and Wiebe analysis, the relationships identified on the basis of the training data are further combined with constraints derived from propositional logic expressions of relationships among word senses, such as those which can be derived from WordNet. Thus they use theoretical knowledge to interconnect the statistical knowledge.

fukumoto_tsujii:95, on the other hand, propose to identify semantic classes of verbs entirely on the basis of statistical clustering, following from the premise that semantically similar words appear in similar contexts. They also argue that polysemous verbs can be recognised by splitting a word cluster into two sets and comparing the semantic deviation of the sets: the distributions of each of the two distinct senses of a word will differ. This definition of polysemy, however, is closer to homonymy and does not allow for subtle gradations between the meanings of polysemous verbs, for which the clusters will differ very little.

In a similar vein, zhai:97 attempts to identify lexical atoms, or two-word idiomatic phrases with a non-compositional meaning, on the basis of several statistical heuristics which measure compositionality. These heuristics compare co-occurrence frequencies, word associations, and context similarity of the two words in a lexical atom independently and together. The idea is that the meaning of a lexical atom [X Y] is radically different from X or Y independently and that therefore the lexical atom will appear in distinct contexts.

In sum, statistical techniques are useful for measuring various relationships between words in a corpus, and for predicting semantic connections on the basis of frequent co-occurrence of certain words in different contexts. The work of fukumoto_tsujii:95 and zhai:97 points towards the usefulness of corpus analysis for identifying entirely distinct uses of a particular word, but such analysis could not easily be extended to automatic discovery of closely related uses of a word or productive lexical rules. No statistical techniques can result in the identification of subtle meaning/usage distinctions. None of these acquisition techniques will result in a generative lexicon which captures the regular relationships between groups of words and addresses the productivity of the lexicon, so the NLP systems in which they can be of use is limited. In addition, the observations made by Church (1994) and the techniques proposed by Bruce and Wiebe (1995) indicate that the information mined from a corpus can most effectively be applied within a theoretical framework which structures and guides the interpretation of the data.


next up previous contents
Next: Prospects for automatic lexicon Up: Acquisition of the Lexicon Previous: Machine Readable Dictionaries